Skip to content

[5.7] Character class and scalar coalescing fixes #588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 21, 2022

Conversation

hamishknight
Copy link
Contributor

@hamishknight hamishknight commented Jul 19, 2022

5.7 cherry-pick of #570 + #574 + #559

Resolves rdar://97386173

@hamishknight hamishknight added the r5.7 5.7 Release Cherry Picks label Jul 19, 2022
@hamishknight
Copy link
Contributor Author

  • Explanation: Fixes the matching behavior of custom character classes. This includes properly coalescing adjacent scalars to produce correct grapheme semantics, fixing ranges to exhibit more consistent matching behavior, and fixing the scalar semantic behavior of quoted sequences. The scalar coalescing fixes also apply to the DSL, such that adjacent scalars in a DSL concatenation are coalesced to produce correct grapheme matching semantics.

  • Scope: Affects runtime matching semantics of custom character classes in regex literals, as well as adjacent unicode scalars values in the DSL.

  • Radar: rdar://97386173

  • Risk: Low/Medium – This change reworks a decent amount of matching engine logic around character classes, however many test cases have been added to ensure correctness. Most of the other changes are generalizations of existing logic.

  • Testing: Added test cases to the repo to ensure the matching behavior is correct.

  • Reviewer: @stephentyrone

This allows us to catch the case where a match
occurs without optimizations, but doesn't occur
with optimizations. Additionally fix the `xfail`
param such that it can't be used on tests that
actually match expectations.
Replace a couple of `#if os(Linux)` checks with
a check to see if we have a newer stdlib
available. This lets us emit an expected failure
in the case where we're testing on an older
stdlib.
Previously we performed a lexicographic
comparison with the bounds of a character class
range. However this produced surprising results,
and our implementation didn't properly handle
case sensitivity.

Update the logic to instead only allow single
scalar NFC bounds. The input is then converted to
NFC in grapheme semantic mode, and checked against
the range. In scalar semantic mode, the input
scalar is checked on its own. Additionally, fix
the case sensitivity handling such that we check
both the lowercase and uppercase version of the
input against the range.
Previously we would emit a series of scalars
written in the DSL as a series of individual
characters in grapheme semantic mode. Change the
behavior such that we coalesce any adjacent
scalars and characters, including those in regex
literals and nested concatenations. We then
perform grapheme breaking over the result, and can
emit character matches for scalars that coalesced
into a grapheme.

This transform subsumes a similar transform we
performed for regex literals when converting them
to a DSLTree. This has the nice side effect of
allowing us to better preserve scalar syntax in
the DSL transform.

rdar://96942688
Previously we would only match entire characters.
Update to use the generic Character consumer logic
that can handle scalar semantic mode.

rdar://97209131
In grapheme semantic mode, coalesce adjacent
character and scalar members of a custom character
class, over which we can perform grapheme breaking.
This involves potentially re-writing ranges such
that they contain a complete grapheme of adjacent
scalars.
Make sure we throw the right error for ranges
that are invalid in grapheme mode, but are valid
in scalar mode.
I also noticed that `lexQuantifier` could silently
eat trivia if it failed to lex a quantification,
so also fix that.
@hamishknight
Copy link
Contributor Author

@swift-ci please test

@hamishknight hamishknight merged commit 4eb3233 into swiftlang:swift/release/5.7 Jul 21, 2022
@hamishknight hamishknight deleted the more-fixes-5.7 branch July 21, 2022 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
r5.7 5.7 Release Cherry Picks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants